AITopics | sparse autoencoder

Collaborating Authors

sparse autoencoder

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Task-Specific Data Selection for Instruction Tuning via Monosemantic Neuronal Activations

Neural Information Processing SystemsJun-22-2026, 18:13:20 GMT

Instruction tuning improves the ability of large language models (LLMs) to follow diverse human instructions, but achieving strong performance on specific target tasks remains challenging. A critical bottleneck is selecting the most relevant data to maximize task-specific performance. Existing data selection approaches include unstable influence-based methods and more stable distribution alignment methods, the latter of which critically rely on the underlying sample representation. In practice, most distribution alignment methods, from shallow features (e.g., BM25) to neural embeddings (e.g., BGE, LLM2Vec), may fail to capture how the model internally processes samples. To bridge this gap, we adopt a model-centric strategy in which each sample is represented by its neuronal activation pattern in the model, directly reflecting internal computation. However, directly using raw neuron activations leads to spurious similarity between unrelated samples due to neuron polysemanticity, where a single neuron may respond to multiple, unrelated concepts. To address this, we employ sparse autoencoders to disentangle polysemantic activations into sparse, monosemantic representations, and introduce a dedicated similarity metric for this space to better identify task-relevant data. Comprehensive experiments across multiple instruction datasets, models, tasks, and selection ratios show that our approach consistently outperforms existing data selection baselines in both stability and task-specific performance2.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia (0.93)
North America > United States > Minnesota (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

LLMInterpretability with Identifiable Temporal-Instantaneous Representation

Neural Information Processing SystemsJun-20-2026, 00:32:00 GMT

Despite Large Language Models' remarkable capabilities, understanding their internal representations remains challenging. Mechanistic interpretability tools such as sparse autoencoders (SAEs) were developed to extract interpretable features from LLMs but lack temporal dependency modeling, instantaneous relation representation, and more importantly theoretical guarantees--undermining both the theoretical foundations and the practical confidence necessary for subsequent analyses. While causal representation learning (CRL) offers theoretically-grounded approaches for uncovering latent concepts, existing methods cannot scale to LLMs' rich conceptual space due to inefficient computation. To bridge the gap, we introduce an identifiable temporal causal representation learning framework specifically designed for LLMs' high-dimensional concept space, capturing both time-delayed and instantaneous causal relations. Our approach provides theoretical guarantees and demonstrates efficacy on synthetic datasets scaled to match real-world complexity. By extending SAE techniques with our temporal causal framework, we successfully discover meaningful concept relationships in LLM activations. Our findings show that modeling both temporal and instantaneous conceptual relationships advances the interpretability of LLMs.

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Law (1.00)
Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Interpreting vision transformers via residual replacement model

Neural Information Processing SystemsJun-17-2026, 06:08:51 GMT

How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

Neural Information Processing Systems

Country: Europe (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment > Sports (0.67)
Health & Medicine > Diagnostic Medicine (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Neural Information Processing SystemsJun-16-2026, 02:36:45 GMT

Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have become a popular tool in interpretability literature. However, recent work has demonstrated phenomenology of model representations that lies outside the scope of this hypothesis, showing signatures of hierarchical, nonlinear, and multi-dimensional features. This raises the question: do SAEs represent features that possess structure at odds with their motivating hypothesis? If not, does avoiding this mismatch help identify said features and gain further insights into neural network representations? To answer these questions, we take a construction-based approach and re-contextualize the popular matching pursuit (MP) algorithm from sparse coding to design MP-SAE--an SAE that unrolls its encoder into a sequence of residual-guided steps, allowing it to capture hierarchical and nonlinearly accessible features. Comparing this architecture with existing SAEs on a mixture of synthetic and natural data settings, we show: (i) hierarchical concepts induce conditionally orthogonal features, which existing SAEs are unable to faithfully capture, and (ii) the nonlinear encoding step of MP-SAE recovers highly meaningful features, helping us unravel shared structure in the seemingly dichotomous representation spaces of different modalities in a vision-language model, hence demonstrating the assumption that useful features are solely linearly accessible is insufficient. We also show that the sequential encoder principle of MPSAE affords an additional benefit of adaptive sparsity at inference time, which may be of independent interest. Overall, we argue our results provide credence to the idea that interpretability should begin with the phenomenology of representations, with methods emerging from assumptions that fit it.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (0.92)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Neural Information Processing SystemsJun-14-2026, 20:41:30 GMT

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable--switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability.

artificial intelligence, machine learning, sae, (16 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Measuring and Guiding Monosemanticity

Neural Information Processing SystemsJun-14-2026, 07:33:22 GMT

There is growing interest in leveraging mechanistic interpretability and controllability to better understand and influence the internal dynamics of large language models (LLMs). However, current methods face fundamental challenges in reliably localizing and manipulating feature representations. Sparse Autoencoders (SAEs) have recently emerged as a promising direction for feature extraction at scale, yet they, too, are limited by incomplete feature isolation and unreliable monosemanticity. To systematically quantify these limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric to quantify feature monosemanticity in latent representation. Building on these insights, we propose Guided Sparse Autoencoders (G-SAE), a method that conditions latent representations on labeled concepts during training. We demonstrate that reliable localization and disentanglement of target concepts within the latent space improve interpretability, detection of behavior, and control. Specifically, our evaluations on toxicity detection, writing style identification, and privacy attribute recognition show that G-SAE not only enhances monosemanticity but also enables more effective and fine-grained steering with less quality degradation. Our findings provide actionable guidelines for measuring and advancing mechanistic interpretability and control of LLMs.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.84)

Add feedback

Projecting Assumptions: The Duality Between Sparse Autoencoders and Concept Geometry

Neural Information Processing SystemsJun-10-2026, 10:32:57 GMT

Sparse Autoencoders (SAEs) are widely used to interpret neural networks by identifying meaningful concepts from their representations. However, do SAEs truly uncover all concepts a model relies on, or are they inherently biased toward certain kinds of concepts? We introduce a unified framework that recasts SAEs as solutions to a bilevel optimization problem, revealing a fundamental challenge: each SAE imposes structural assumptions about how concepts are encoded in model representations, which in turn shapes what it can and cannot detect. This means different SAEs are not interchangeable -- switching architectures can expose entirely new concepts or obscure existing ones. To systematically probe this effect, we evaluate SAEs across a spectrum of settings: from controlled toy models that isolate key variables, to semi-synthetic experiments on real model activations and finally to large-scale, naturalistic datasets. Across this progression, we examine two fundamental properties that real-world concepts often exhibit: heterogeneity in intrinsic dimensionality (some concepts are inherently low-dimensional, others are not) and nonlinear separability. We show that SAEs fail to recover concepts when these properties are ignored, and we design a new SAE that explicitly incorporates both, enabling the discovery of previously hidden concepts and reinforcing our theoretical insights. Our findings challenge the idea of a universal SAE and underscores the need for architecture-specific choices in model interpretability. Overall, we argue an SAE does not just reveal concepts -- it determines what can be seen at all.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.62)

Add feedback

Feature Starvation as Geometric Instability in Sparse Autoencoders

Chaudhry, Faris, Yano, Keisuke, Monod, Anthea

arXiv.org Machine LearningMay-8-2026

Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard $\ell_1$-regularized SAEs suffer from feature starvation (dead neurons) and shrinkage bias, often requiring computationally expensive heuristic resampling and nondifferentiable hard-masking methods to bypass these challenges. We argue that feature starvation is not merely an empirical artifact of poor data diversity, but a fundamental optimization-geometric pathology of overcomplete dictionaries: the $\ell_1$-induced sparse coding map is unstable and fundamentally misaligned with shallow, amortized encoders. To address this structural instability, we introduce adaptive elastic net SAEs (AEN-SAEs), a fully differentiable architecture grounded in classical sparse regression. AEN-SAEs combine an $\ell_2$ structural term that enforces strong convexity and Lipschitz stability with adaptive $\ell_1$ reweighting that eliminates shrinkage bias and suppresses spurious features, thereby jointly controlling the curvature and interaction structure of the induced polyhedral geometry. Theoretically, we show that AEN-SAEs yield a Lipschitz-continuous sparse coding map and recover the global feature support under mild assumptions. Empirically, across synthetic settings and LLMs (Pythia 70M, Llama 3.1 8B), AEN-SAEs mitigate feature starvation without auxiliary heuristics while maintaining competitive reconstruction abilities.

feature starvation, large language model, machine learning, (18 more...)

arXiv.org Machine Learning

2605.05341

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

A Unifying Framework for Unsupervised Concept Extraction

Squires, Chandler, Ravikumar, Pradeep

arXiv.org Machine LearningApr-29-2026

Techniques for concept extraction, such as sparse autoencoders and transcoders, aim to extract high-level symbolic concepts from low-level nonsymbolic representations. When these extracted concepts are used for downstream tasks such as model steering and unlearning, it is essential to understand their guarantees, or lack thereof. In this work, we present a unified theoretical framework for unsupervised concept extraction, in which we frame the task of concept extraction as identifying a generative model. We present a general meta-theorem for identifiability, which reduces the problem of establishing identifiability guarantees to the problem of characterizing the intersection of two sets. As we demonstrate on a range of widely-used approaches, this meta-theorem substantially simplifies the task of proving such guarantees, thus paving the way for the development of new, principled approaches for concept extraction.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Machine Learning

2604.24936

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology: